Customer churning (or customer attrition rate) is a problem for any business in the service industry, you only make money by keeping customers interested in your product. In the financial service industry this usually takes the form of credit cards and so the more people that use their credit card service, the more money they will make. Being able to determine which customers are the most likely to drop their credit card and by extension, be able to reach out to those customers before they drop the card to fix their problem. This could give the bank a competitive advantage in the marketplace by keeping more customers using their credit card over their competitors.
Download Location: https://www.kaggle.com/sakshigoyal7/credit-card-customers
import numpy as np
import pandas as pd
import seaborn as sb
import scikitplot as skplt
# imblearn Libraries
from imblearn.over_sampling import SMOTE
from imblearn import __version__ as imbv
# scipy Libraries
from scipy.stats import norm
from scipy import __version__ as scipv
# matplotlib Libraries
import matplotlib.pyplot as plt
from matplotlib import __version__ as mpv
# plotly Libraries
import plotly.express as ex
from plotly import __version__ as pvm
# sklearn Libraries
from sklearn.pipeline import Pipeline
from sklearn.decomposition import PCA
from sklearn import __version__ as skv
from xgboost.sklearn import XGBClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import make_scorer, recall_score, confusion_matrix
# Library Versions
print('Using version %s of scipy' % scipv)
print('Using version %s of pandas' % pd.__version__)
print('Using version %s of numpy' % np.__version__)
print('Using version %s of plotly' % pvm)
print('Using version %s of imblearn' % imbv)
print('Using version %s of sklearn' % skv)
print('Using version %s of seaborn' % sb.__version__)
print('Using version %s of matplotlib' % mpv)
bankData = pd.read_csv('BankChurners.csv')
print("The dimension of the data is: {:,} (rows) by {:,} (columns)".format(bankData.shape[0], bankData.shape[1]))
bankData.head()
bankData.describe()
ex.pie(bankData, names='Gender', title='Proportion of Customer Genders')
There are slightly more female than male customers but the difference is so small that it won't have a significant impact on the overall data analysis. For all intends and purposes we can say that the genders are uniformly distributed.
ex.pie(bankData, names='Education_Level', title='Proportion of Education Levels')
We can see that the largest amount of customers have at least a graduate level education, with the second highest being high school level.
ex.pie(bankData, names='Marital_Status', title='Proportion of Marital Status')
From the graph above, we can see that the majority of customers are either married or single.
income = ex.pie(bankData, names='Income_Category', title='Proportion of Different Income Levels')
newNames = {'$40K - $60K': '$40K - 60K', '$60K - $80K': '$60K - 80K', '$80K - $120K': '$80K - 120K'}
for item in newNames:
for i, elem in enumerate(income.data[0].labels):
if elem == item:
income.data[0].labels[i] = newNames[item]
income
From the graph above, we can see that the majority of customers earn less than $40k a year.
ex.pie(bankData, names='Card_Category', title='Proportion of Different Card Categories')
From the graph above, we can see that an overwhelming majority of customers use the banks "Blue" card
ex.pie(bankData, names='Attrition_Flag', title='Proportion of Attrited vs Existing Customers')
Since the majority of the customer data we have is of existing customers, i will be using SMOTE to upsample the attrited samples to match them with the regular customer sample size to balance out the skewed data and thus, also helping to improve the performance of the later selected models.
fig = plt.figure()
fig.subplots_adjust(hspace=0.8, wspace=0.5)
fig.set_size_inches(13.5, 15)
sb.set(font_scale = 1.25)
hists = ['Customer_Age', 'Dependent_count', 'Months_on_book', 'Total_Relationship_Count',
'Months_Inactive_12_mon', 'Credit_Limit', 'Total_Trans_Amt', 'Avg_Utilization_Ratio']
i = 1
for var in hists:
fig.add_subplot(4, 2, i)
sb.distplot(pd.Series(bankData[var], name=''),
fit=norm, kde=False).set_title(var + " Histogram")
plt.ylabel('Count')
i += 1
fig.tight_layout()
fig = plt.figure()
fig.subplots_adjust(hspace=0.8, wspace=0.5)
fig.set_size_inches(13.5, 16)
sb.set(font_scale = 1.25)
boxs = ['Customer_Age', 'Dependent_count', 'Months_on_book', 'Total_Relationship_Count',
'Months_Inactive_12_mon', 'Credit_Limit', 'Total_Trans_Amt', 'Avg_Utilization_Ratio']
i = 1
for var in boxs:
fig.add_subplot(8, 1, i)
sb.boxplot(pd.Series(bankData[var], name='')).set_title(var + " Box Plot")
i += 1
fig.tight_layout()
bankData['Attrition_Flag'] = bankData['Attrition_Flag'].replace({'Attrited Customer':1, 'Existing Customer':0})
bankData['Gender'] = bankData['Gender'].replace({'F':1, 'M':0})
bankData = pd.concat([bankData, pd.get_dummies(bankData['Education_Level']).drop(columns=['Unknown'])], axis=1)
bankData = pd.concat([bankData, pd.get_dummies(bankData['Income_Category']).drop(columns=['Unknown'])], axis=1)
bankData = pd.concat([bankData, pd.get_dummies(bankData['Marital_Status']).drop(columns=['Unknown'])], axis=1)
bankData = pd.concat([bankData, pd.get_dummies(bankData['Card_Category']).drop(columns=['Platinum'])], axis=1)
bankData.drop(columns = ['Education_Level', 'Income_Category', 'Marital_Status', 'Card_Category', 'CLIENTNUM'], inplace=True)
print("The dimension of the data is: {:,} (rows) by {:,} (columns)".format(bankData.shape[0], bankData.shape[1]))
bankData.head()
fig = plt.figure()
fig.set_size_inches(30, 20)
sb.set(font_scale = 1)
sb.heatmap(bankData.corr('pearson'), annot=True)
From the above correlation matrix, we can see that there are now quite a few variables and using all of them for modeling could pose to be a problem. I will first up-sample the data to even out the skewedness of the attrited customers and then use PCA to reduce the number of encoded features in the dataset.
smote_sample = SMOTE()
X, y = smote_sample.fit_resample(bankData[bankData.columns[1:]], bankData[bankData.columns[0]])
up_sampData = X.assign(Attrition = y)
encoded_cols = up_sampData[up_sampData.columns[15:-1]]
up_sampData = up_sampData.drop(columns=up_sampData.columns[15:-1])
Using principal component analysis to reduce the dimensionality of the encoded categorical variables will lose some of the variances in the data but as a result of this, using only a few of the principal components instead of all the encoded features will help to construct a better model.
fig = plt.figure()
fig.set_size_inches(15, 12)
sb.set(font_scale = 1.25)
N_COMPONENTS = len(encoded_cols.columns)
pca = PCA(n_components = N_COMPONENTS)
pc_matrix = pca.fit_transform(encoded_cols)
evr = pca.explained_variance_ratio_ * 100
cumsum_evr = np.cumsum(evr)
ax = sb.lineplot(x=np.arange(1, len(cumsum_evr) + 1), y=cumsum_evr, label='Explained Variance Ratio')
ax.lines[0].set_linestyle('-.')
ax.set_title('Explained Variance Ratio Using {} Components'.format(N_COMPONENTS))
ax.plot(np.arange(1, len(cumsum_evr) + 1), cumsum_evr, 'bo')
for x, y in zip(range(1, len(cumsum_evr) + 1), cumsum_evr):
plt.annotate("{:.2f}%".format(y), (x, y), xytext=(2, -15),
textcoords="offset points", annotation_clip = False)
ax = sb.lineplot(x=np.arange(1, len(cumsum_evr) + 1), y=evr, label='Explained Variance Of Component X')
ax.plot(np.arange(1, len(evr) + 1), evr,'ro')
ax.lines[1].set_linestyle('-.')
ax.set_xticks([i for i in range(1, len(cumsum_evr) + 1)])
for x, y in zip(range(1, len(cumsum_evr) + 1), evr):
if x != 1:
plt.annotate("{:.2f}%".format(y), (x, y), xytext=(2, 5),
textcoords="offset points", annotation_clip = False)
ax.set_xlabel('Component Number')
ax.set_ylabel('Explained Variance')
The graph above shows the explained variance of each PCA component, along with the cumulative sum of the components above. Looking at the values above, i will be using 8 of the 17 PCA components because it reduces the total number of enocded features by over half, while still explaining roughly 80% of the encoded data.
up_sampData_PCA = pd.concat([up_sampData,
pd.DataFrame(pc_matrix, columns=['PC-{}'.format(i) for i in range(1, N_COMPONENTS + 1)])], axis=1)
up_sampData_PCA = up_sampData_PCA[up_sampData_PCA.columns[:24]]
up_sampData_PCA.head()
fig = plt.figure()
fig.set_size_inches(20, 15)
sb.set(font_scale = 0.9)
sb.heatmap(up_sampData_PCA.corr('pearson'), annot=True)
The models I have selected to experiment with in this analysis are the following: Logistic Regression, XGB Classifier, Decision Tree Classifier, and Random Forest Classifier. The models performances (Recall Score) with the training data will be compared at the end to see which model performed the best and then the best model will be used as the final model for predicting on the test set.
seed = 74 # Seed for train/test split reproduction
x_train, x_test, y_train, y_test = train_test_split(up_sampData_PCA[up_sampData_PCA.columns.drop('Attrition')],
up_sampData_PCA['Attrition'],
train_size=0.65,
random_state=seed)
x_train.head()
y_train.head()
lr_pipe = Pipeline(steps=([
('scale', StandardScaler()),
('lr', LogisticRegression(random_state=seed))
]))
param_grid = {'lr__penalty' : ['l1', 'l2', 'elasticnet', 'none'],
'lr__fit_intercept': [True, False],
'lr__class_weight': ['balanced', None],
'lr__solver': ['newton-cg', 'lbfgs', 'liblinear', 'sag', 'saga'],
'lr__max_iter': np.arange(100, 600, 100),
'lr__warm_start': [True, False]}
lr_grid = GridSearchCV(lr_pipe, scoring=make_scorer(recall_score),
param_grid = param_grid, cv = 5, n_jobs = -1, verbose=2)
lr_grid.fit(x_train, y_train)
lr_df = pd.DataFrame(lr_grid.cv_results_).sort_values('mean_test_score',
ascending=False)[['params', 'mean_test_score']].head(10)
lr_df
print('Best Logistic Regression Parameters\n' + '='*35)
for name, val in lr_df.iloc[0]['params'].items():
print('{:>19}: {}'.format(name.replace('lr__', ''), val))
lr_recall = lr_df.iloc[0]['mean_test_score']
print('\nRecall Score: {}'.format(round(lr_recall, 4)))
xg_pipe = Pipeline(steps=([
('scale', StandardScaler()),
('xg', XGBClassifier(random_state=seed))
]))
param_grid = {'xg__use_label_encoder': [False],
"xg__learning_rate": [0.05, 0.1, 0.2],
'xg__eval_metric': ['logloss'],
'xg__booster': ['gbtree', 'gblinear'],
'xg__importance_type': ['gain', 'weight'],
"xg__subsample": [0.8, 0.9, 1],
"xg__colsample_bytree": [0.8, 0.9, 1],
"xg__max_depth": [5, 6],
"xg__reg_lambda": [0.1, 0.2]}
xg_grid = GridSearchCV(xg_pipe, scoring=make_scorer(recall_score),
param_grid = param_grid, cv = 5, n_jobs = -1, verbose=2)
xg_grid.fit(x_train, y_train)
xg_df = pd.DataFrame(xg_grid.cv_results_).sort_values('mean_test_score',
ascending=False)[['params', 'mean_test_score']].head(10)
xg_df
print('Best XG Boost Classifier Parameters\n' + '='*35)
for name, val in xg_df.iloc[0]['params'].items():
print('{:>19}: {}'.format(name.replace('xg__', ''), val))
xg_recall = xg_df.iloc[0]['mean_test_score']
print('\nRecall Score: {}'.format(round(xg_recall, 4)))
dt_pipe = Pipeline(steps=([
('scale', StandardScaler()),
('dt', DecisionTreeClassifier(random_state=seed))
]))
param_grid = {'dt__criterion': ['gini', 'entropy'],
'dt__class_weight': ['balanced', None],
'dt__splitter': ['best', 'random'],
'dt__max_features': ['auto', 'sqrt', 'log2'],
'dt__max_depth': [2, 4, 6],
'dt__min_samples_leaf': [1, 2, 4],
'dt__min_samples_split': [1, 2, 4]}
dt_grid = GridSearchCV(dt_pipe, scoring=make_scorer(recall_score),
param_grid = param_grid, cv = 5, n_jobs = -1, verbose=2)
dt_grid.fit(x_train, y_train)
dt_df = pd.DataFrame(dt_grid.cv_results_).sort_values('mean_test_score',
ascending=False)[['params', 'mean_test_score']].head(10)
dt_df
print('Best Decision Tree Classification Parameters\n' + '='*44)
for name, val in dt_df.iloc[0]['params'].items():
print('{:>23}: {}'.format(name.replace('dt__', ''), val))
dt_recall = dt_df.iloc[0]['mean_test_score']
print('\nRecall Score: {}'.format(round(dt_recall, 4)))
rf_pipe = Pipeline(steps=([
('scale', StandardScaler()),
('rf', RandomForestClassifier(random_state=seed))
]))
param_grid = {'rf__max_depth': [2, 4, 6],
'rf__class_weight': ['balanced', 'balanced_subsample'],
'rf__criterion': ['gini', 'entropy'],
'rf__max_features': ['auto', 'sqrt', 'log2'],
'rf__min_samples_leaf': [1, 2, 4],
'rf__min_samples_split': [2, 5, 7],
'rf__n_estimators': np.arange(100, 400, 100)}
rf_grid = GridSearchCV(rf_pipe, scoring=make_scorer(recall_score),
param_grid = param_grid, cv = 5, n_jobs = -1, verbose=2)
rf_grid.fit(x_train, y_train)
rf_df = pd.DataFrame(rf_grid.cv_results_).sort_values('mean_test_score',
ascending=False)[['params', 'mean_test_score']].head(10)
rf_df
print('Best Random Forest Classification Parameters\n' + '='*44)
for name, val in rf_df.iloc[0]['params'].items():
print('{:>19}: {}'.format(name.replace('rf__', ''), val))
rf_recall = rf_df.iloc[0]['mean_test_score']
print('\nRecall Score: {}'.format(round(rf_recall, 4)))
recall_scores = [lr_recall, xg_recall, dt_recall, rf_recall]
modelTypes = ['Logistic Regression', 'XG Boost Classifier', 'Decision Tree Classifier', 'Random Forest Classifier']
recall_df = pd.DataFrame(zip(modelTypes, recall_scores),
columns=['Model Type', 'Recall Score'])
recall_df = recall_df.nlargest(len(recall_df), 'Recall Score').reset_index(drop=True)
recall_df
From the above we can see that all of the models performed very well with the training data, with the best performing model being the XG Boost Classifier. As a result, the XG Boost Classifier will be the model that is used to make predictions using the test for the final analysis and results.
print('Best XG Boost Classifier Parameters\n' + '='*35)
params = {}
for name, val in xg_df.iloc[0]['params'].items():
name = name.replace('xg__', '')
params.update({name: val})
print('{:>21}: {}'.format(name, val))
xg_recall = xg_df.iloc[0]['mean_test_score']
print('\nRecall Score: {}'.format(round(xg_recall, 4)))
best_pipe = Pipeline(steps=([
('scale', StandardScaler()),
('xg', XGBClassifier(**params, random_state=seed))
]))
best_model = best_pipe.fit(x_train, y_train)
best_model
y_pred = best_model.predict(x_test)
best_model_score = recall_score(y_test, y_pred)
print("Best XG Boost Classifier score using the test data\n" + '='*50 +
"\nTest Recall Score: {}\n\nTrain Recall Score: {}".format(round(best_model_score, 4),
round(xg_recall, 4)))
print('\nDifference between train and best model test recall scores: {}'
.format(abs(round(best_model_score - xg_recall, 4))))
Since the recall scores is so close to the value i received during my training experiments, i am confident the model i have selected will perform well with future, unseen, customer data.
encoded_cols = bankData[bankData.columns[16:]]
pc_matrix = pca.fit_transform(encoded_cols)
orginData_PCA = pd.concat([bankData[bankData.columns.drop(encoded_cols.columns)],
pd.DataFrame(pc_matrix, columns=['PC-{}'.format(i) for i in range(1, N_COMPONENTS + 1)])], axis=1)
orginData_PCA = orginData_PCA[orginData_PCA.columns[:24]]
orginData_PCA_Pred = best_model.predict(orginData_PCA[orginData_PCA.columns[1:]])
print("Best XG Boost Classifier score using the Original Dataset\n" + '='*57 +
"\nRecall Score: {}".format(round(recall_score(orginData_PCA['Attrition_Flag'], orginData_PCA_Pred), 4)))
fig = plt.figure()
fig.set_size_inches(16, 10)
sb.set(font_scale = 1.5)
conf = sb.heatmap(confusion_matrix(orginData_PCA_Pred, orginData_PCA['Attrition_Flag']),
annot=True, cmap='coolwarm', fmt='d')
conf.set_title('Prediction On Original Data With XG Boost Classifer Model Confusion Matrix')
conf.set_xticklabels(['Not Attrited', 'Attrited'])
conf.set_yticklabels(['Predicted Not Attrited',' Predicted Attrited'])
sb.set(font_scale = 1.5)
orginData_PCA_Proba = best_model.predict_proba(orginData_PCA[orginData_PCA.columns[1:]])
skplt.metrics.plot_precision_recall(orginData_PCA['Attrition_Flag'], orginData_PCA_Proba, figsize=[16, 10])
From the above Confusion Matrix and Precision-Recall Curve graphs, it is evident that the XG Boost Classifier (with tuned hyperparameters) performed very well with the data and made some very good predictions using the test set and original dataset (without up-sampling). Due to all of the analysis and the final results, i am confident that this XG Boost Classifier model will perform well for bank for predicting customer attrition with their credit card.